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SEMANTICALLY REDUCING THE NUMBER OF PARTITIONS INVOLVED IN A JOIN 

Background 

[0001] Relational database systems allow data to be stored in tables that are organized as a set of 
columns and rows. Standard commands are used to define the columns and rows of tables, and data is 
5 subsequently entered in accordance with the defined structure. The defined table structure is logically 
maintained, but may not correspond to the physical organization of the data. For example, the data 
corresponding to a particular table may be divided among a number of physical hardware storage 
facilities. 

[0002] Users of relational database systems require the minimum time possible for execution of 
complex queries against a large database. The organization of data corresponding to tables defined in 
a relational database system may influence the time required to execute a common query. If the data is 
properly organized, performance can be improved by searching only a part of the data for queries that 
can take advantage of that organization. 

[0003] Given a particular change in the organization of data, such as partitioning, particular types of 
searches may be adversely impacted in terms of efficiency if they are performed without any 
adjustment. Many factors may influence the adjustment of a search type that is to be performed with 
respect to a new organization of data. Such factors include the file system that identifies the location 
of the data and its associated information, and the desired outcome of the search. Furthermore, given 
structural constraints such as functional dependencies, key dependencies, value constraints, and 
referential constraints defined in the database, the query can be rewritten to increase its efficiency. 
Considering and addressing these factors together can make a search more efficient. 

Summary 

[0004] In general, in one aspect, the invention features a method of joining two tables, ti and t2. Each 
table contains rows and columns and is divided into one or more partitions. The method includes 
25 calculating a correlation function between a first correlated value column of table ti and a second 
correlated value column of table t2. Following a query request, the method further includes a joining 
algorithm to join t\ and t2, using a reduced set of partitions based in part upon the correlation function. 
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[0005] Implementation of this invention may include one or more of the following. At each iteration 
of the joining algorithm, the method may calculate a set of partitions in each table to be joined. Each 
set of partitions may be defined by a starting partition number and by the number of partitions in each 
set, fi and f2, respectively. A join may be performed on the reduced set of partitions from each table. 
5 If there are more than two empty partitions, the number of partitions in each set may be increased. 
Thus the active partitions in the span of fi and f 2 partitions may be contiguous. At the next iteration, 
the starting partition number for each set of partitions may be incremented, and the join algorithm may 
be performed on the new set of partitions. The method may iterate until all partitions have been 
subjected to the joining algorithm. In another implementation, the algorithm may iterate until there are 
10 no more non-eliminated partitions. 

1^ [0006] The correlation function may be calculated by joining table ti to table t 2 using PK=FK as the 
O join condition to produce a join result having rows, each row including a value from cvj and a value 

y from CV2. In this implementation, PK may denote a primary key column in table ti, FK may denote a 

m 

^ foreign key column in table t 2 , cvi may denote a first correlated value column in table ti, and cv 2 , 

H : 5 denotes a second correlated value column in table t 2 . The method may create an initial running 

m 

s constraint comprising a null range. A derived constraint rule having the form 

M 

FU 



m 



i* 



[0007] (PK = FK) cv 2 + Ci < cvi < cv 2 + c 2 , 



[0008] Where ci and c 2 are constants, and "-»" means "implies 0 may be produced by computing a new 
constraint (NEW) having a range, and modifying RC by merging the range of NEW with the range of 
20 RC for each row in the join result. 



[0009] The number of partitions in each table to be joined at each iteration, fi and f 2 , respectively, may 
be calculated as follows. A parameter n may be calculated to be the number of contiguous partitions in 
table ti that may have rows matching rows in a single partition of table t 2 . The parameter n may be 
calculated from the equation n= pc 2 - pcj + 1. The parameters pcj and pc 2 may be called partitioning 
25 correlation coefficients. The partitioning correlation coefficient pci may be set equal to the value of 
(SIGN(ci)*CEILING(ABS(ci)/p)), and the partitioning correlation coefficient pc 2 may be set equal to 
the value of (SIGN(c 2 )*CEILING(ABS(c 2 )/p)). The maximum number of file contexts, m, may be 
calculated based upon the amount of available memory. The parameter f 2 may be set equal to the 
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smallest integer value that is equal to or greater than the value of ((m-n)/2), and the parameter fj may 
be set equal to n + f 2 -1 . 

[0010] At each iteration the starting partition numbers, Pi and P2, of each table to be joined may be 
calculated as follows. P2 may be set equal to a lowest partition number in table t 2 such that P2 may be 
5 a first active, non-eliminated partition in table t2, and at least one of the partitions in the interval 
between P2 - pc2 and P2 - pci in table ti may be an active, non-eliminated partition. Pi may be set 
equal to P2 - pc2. Pi and P2 may be updated during each iteration by finding a lowest partition number 
P* in table t 2 that is greater than or equal to the sum of P2 + f2 such that P 2 * is the first active, non- 
eliminated partition, and at least one of the partitions in the interval between P* - pc2 and P* - pci in 
10 table ti is an active, non-eliminated partition. P 2 may be set equal to P* , and Pi may be set equal to 

p [0011] The span of the fj partitions in table ti and the span of the f2 partitions in table t 2 may be 

p increased as follows. A parameter eps may be set equal to a minimum number of inactive or 

(fi eliminated partitions in the span of fi partitions in table ti beginning at Pi and in the span of f2 partitions 

ji 15 of table t2 beginning at P2. Alternatively, the parameters fi and f2 may be increased if the sum of the 

W inactive partitions in the span of fi partitions in table ti beginning at Pi and in the span of f2 partitions 

ru 

n\ of table t2 beginning at P2 is greater than or equal to two; in this case, eps would be this sum divided by 
j=f two and truncated. The parameters fi and f2 may be each increased by eps. The parameters fi and f 2 
may be further increased if the addition of eps creates extra empty partitions in the increased fj and f 2 . 

20 [0012] The joining algorithm includes creating a file context, which stores the location data for a row 
and a first value associated with the row, for each partition of the set of partitions to be joined. The 
joining algorithm also may determine the lowest first value stored by the file contexts that is equal to 
or greater than a particular hash value. Finally, the joining algorithm may identify rows with a 
particular first value by reading the file contexts. 

25 [0013] In general, in another aspect, the invention features a computer program, stored in tangible 

medium, for joining a first table ti and a second table t2. Each table may contain rows and columns, 

and each table may be divided into one or more partitions. The computer program comprises 

executing instructions that may cause a computer to calculate a correlation function between a first 

correlated value column of table ti and a second correlated value column of table t 2 . Following receipt 
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of a query requesting a join between table tj and table t 2 , the program may perform a joining algorithm, 
wherein the partitions containing the rows to be joined may be determined based at least in part upon 
the correlation function. 

[0014] The computer program may comprise executing instructions that may cause a computer to 
5 calculate, based at least in part upon the correlation function, a first number fi and a second number f 2 , 
wherein fi and f 2 denote the number of partitions of table ti and table t 2 , respectively, to be joined. The 
computer program also may determine based at least in part upon the correlation function, a first 
starting partition number Pi for table ti and a second starting partition number P 2 for table t 2 . The 
program may perform a joining algorithm, wherein a set of f 2 partitions of table t 2 starting at P 2 are 
10 joined with a set of f\ partitions of table ti starting at Pi. The algorithm may iterate until all partitions 
have been subjected to the joining algorithm. 

Li 
gas- 

Q [0015] In general, in another aspect, the invention features a system in which a first table ti may joined 

D 

m with a second table t 2 . Each table containing rows and columns and is divided into one or more 
p partitions. The system may include a massively parallel processing system comprising one or more 
nodes, a plurality of CPUs, with each of the one or more nodes providing access to one or more CPUs, 

m 

g and a plurality of processes. Each of the one or more CPUs may provide access to one or more virtual 
Jl* processes. Each process may be configured to manage data, including the partitioned database table, 

py 

fU stored in one of a plurality of data-storage facilities. A partitioned table access component may be 
pj configured to select rows from the table by calculating a correlation function between a first correlated 
hfO value column of table ti and a second correlated value column of table t 2 , to receive a query requesting 
a join between table ti and table t 2 , and to perform a joining algorithm, wherein the partitions 
containing the rows to be joined may be determined based at least in part upon the correlation function. 

[0016] Other features and advantages will become apparent from the description and claims that 
follow. 

25 Brief Description of the Drawings 

[0017] Fig. 1 is a block diagram of a node of a database system. 

[0018] Fig. 2 is a flow diagram of a table distribution process. 
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[0019] Fig. 3 illustrates an example of rows from two partitioned tables residing in a data storage 
facility. 

[0020] Fig. 4 is a flow chart for the joining of two tables using a correlation function. 

[0021] Fig. 5 is a flow chart of an algorithm for calculating a correlation function. 

5 [0022] Fig. 6 is a flow chart for an algorithm for joining two partitioned tables. 

[0023] Fig. 7 is a flow chart for an algorithm that calculates the number of partitions in each table that 
may be joined per iteration. 

[0024] Fig. 8 is a flow chart of an algorithm for selecting rows from a set of partitions. 
[0025] Fig. 9 is a flow chart of an algorithm for creating a file context for each partition in two sets of 



MR) partitions. 



W 

01 [0026] Fig. 10 is a flow chart of an algorithm for providing rows with a particular hash value. 

b 

jUL 

J [0027] Fig. 1 1 is a flow chart of an algorithm for updating the file contexts. 

3 

H= [0028] Figs. 12A, 12B, and 12C illustrate one implementation of an advancing algorithm based upon 

rtj 

p | correlation analysis of two tables to be joined. 

m 

p5 Detailed Description 

[0029] The partitioned table storage technique disclosed herein has particular application, but is not 
limited, to large databases that might contain many millions or billions of records managed by a 
database system ("DBS") 100, such as a Teradata Active Data Warehousing System available from 
20 NCR Corporation. Fig. 1 shows a sample architecture for one node 105i of the DBS 100. The DBS 
node 105i includes one or more processing modules 1 10i... N , connected by a network 115, that manage 
the storage and retrieval of data in data-storage facilities 120i...n. Each of the processing modules 
U0i...n may be one or more physical processors or each may be a virtual processor, with one or more 
virtual processors running on one or more physical processors. 

25 [0030] For the case in which one or more virtual processors are running on a single physical processor, 
the single physical processor swaps among the set of N virtual processors. 
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[0031] For the case in which N virtual processors are running on an M-processor node, the node's 
operating system schedules the N virtual processors to run on its set of M physical processors. If there 
are 4 virtual processors and 4 physical processors, then typically each virtual processor would run on 
its own physical processor. If there are 8 virtual processors and 4 physical processors, the operating 
5 system would schedule the 8 virtual processors against the 4 physical processors, in which case 
swapping of the virtual processors would occur. 

[0032] Each of the processing modules 1 IOi . n manages a portion of a database that is stored in a 
corresponding one of the data-storage facilities 120i...n- Each of the data-storage facilities 120i...n 
includes one or more disk drives. The DBS may include multiple nodes 1052...X in addition to the 
1 0 illustrated node 1 05 1 , connected by extending the network 115. 

[0033] The system stores data in one or more tables in the data-storage facilities 120i...n. The rows 
125i...z of the tables are stored across multiple data-storage facilities 120i... N to ensure that the system 
workload is distributed evenly across the processing modules 1 10i.,. N . A parsing engine 130 organizes 
the storage of data and the distribution of table rows 125i...z among the processing modules 110i...n. 
The parsing engine 130 also coordinates the retrieval of data from the data-storage facilities 120i...n in 
response to queries received from a user at a mainframe 135 or a client computer 140. The DBS 100 
usually receives queries and commands to build tables in a standard format, such as SQL. 

[0034] The rows 125i...z are distributed across the data-storage facilities 120i...n by the parsing engine 
130 in accordance with their primary index. The primary index defines the columns of the rows that 
are used for calculating a hash value. The function that produces the hash value from the values in the 
columns specified by the primary index is called the hash function. Some portion, possibly the 
entirety, of the hash value is designated a "hash bucket". The hash buckets are assigned to data-storage 
facilities 120i... N and associated processing modules 1 10i...n by a hash bucket map. The characteristics 
of the columns chosen for the primary index determine how evenly the rows are distributed. 

25 [0035] Fig. 2 shows one implementation of how the rows of a table are distributed. The table 200 
contains a plurality of rows and is stored in a plurality of data storage facilities 120m by the parsing 
engine 130, shown in Fig. 1. For example, two columns 210, 220 can be designated as the primary 
index when the table is created. The hash function is then applied to the contents of columns 210, 220 
for each row. The hash bucket portion of the resulting hash value is mapped to one of the data storage 
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facilities 120i-4 and the row is stored in that facility. For example, if the primary index indicates a 
column containing a sequential row number and the hash function is the sum of the value one and the 
remainder when the sequential row number is divided by four, the first eight rows will be distributed as 
shown in Fig. 2. 

5 [0036] Queries involving the values of columns in the primary index can be efficiently executed 
because the processing module 110 n having access to the data storage facility 120 n that contains the 
row can be immediately determined. For example, referring to Fig. 2, if values from row 2 are desired, 
the parsing engine 130 can apply the hashing function to determine that only processing module 1 IO2 
needs to be used. As another example, an equality join between two tables that have the same primary 
10 index columns is more efficient. All of the rows that need to be joined are found in the same data 
storage facility 120 n and no movement of information from rows between the data storage facilities is 

y, necessary. 

O 

[0037] While the primary index of a table can be chosen for equality joins, for example the order 

61 number column of an order table, additional design features can make range searches, for example a 

□ 

|J5 range of dates from the date column, more efficient. Referring to Fig. 3, a partitioned database storage 

^ facility 120 2 is shown. Rows from two partitioned tables 305 and 340 are organized within the storage 

s 

h* facility 120 2 in accordance with a row identification (row ED) that can include values associated with a 

m 

partition function as well as values associated with the hash function and a uniqueness value. The 

rows stored in the storage facility 1202 are ordered at a top level by the result of the partition function. 

u 

120 As a result, a first group of rows 310 has one partition function value. A second group of rows 320 has 
another partition function value. A third group of rows 330 has a third partition value. The groups 
310, 320, 330 are ordered by their partition values and are also known as partitions. 

[0038] The rows are also ordered within each partition. For example, the first partition 310 contains 
five rows. Those rows are stored within that partition 310 in the order of the hash result for each row. 

25 The hash result therefore acts as a sorting or ordering value. A uniqueness value is also maintained for 
each row. In one implementation, no two rows with the same partition and hash value in a table can 
have the same uniqueness value. The uniqueness values are determined when the rows are added to 
the table. For example, a sequential number (the next uniqueness number after the highest one 
currently being used) or any currently unused number can be used as the uniqueness value. If two 

30 rows are in the same partition and have the same hash value, their order is determined by their 
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uniqueness values, which by definition cannot be identical. The uniqueness value does not play a role 
in ordering rows that have different partition or hash values. In another implementation, uniqueness 
values are not assigned to the rows and the order of rows with identical hash values is not determined. 

[0039] A partition function can return a number for a row based on the range of values into which that 
5 row's value in a certain column falls. For example, if an order table in a database has the order number 
column as that table's primary index, the partition function can correspond to the month of the order 
date. In that situation, the rows of the order table would be distributed to storage facilities based on the 
result of applying the hash function to the order number. In each storage facility, the rows would be 
ordered based on a monthly range of dates. For example, the first partition 310 could include all rows 
10 for orders in January 2001. The second partition 320 could include all rows for orders in February 
2001. Within each partition the rows are in the order of the hash value and, where hash values are the 
same, in order by uniqueness value. Such a partitioned table could be efficiently searched on ranges 
by eliminating partitions from the required search. For example, if all orders for a certain product 
during a two-month period is desired, only two partitions would need to be checked for the specified 



Q5 product. The monthly range is just one example of a possible partition function. Any type of function 

jt can be used. 

pi 

X 

H* [0040] For some tables, even though rows within the same hash may occur in each of the partitions of 

i 

fy the table, the rows being joined in each table with the same primary index value may occur only in a 
2 few partitions. That is, there is a correlation between the partitioning columns (in particular, for range 
H20 partitioning) for the two tables for any particular value of the primary index columns. An efficient join 
algorithm can be developed by exploiting this semantic correlation between partitioning columns. 

[0041] One implementation of a semantic partitioned pair elimination algorithm improves performance 
under the following conditions: (a) both tables have partitioned primary indexes and are range 
partitioned on comparable columns (for example, date columns); (b) the join is a join on equality 
25 constraints (equi-join, semi-join, etc.); (c) the primary index columns of both tables are the join 
columns; and (d) the range partitioning columns are correlated as described above. 

[0042] A system that performs such an efficient join algorithm, shown in Figure 4, calculates a 
semantic correlation function (block 410) and performs a join using that function (block 420). 
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[0043] Semantic Correlation Analysis 

[0044] One method of calculating the semantic correlation function of block 410 is shown in Figure 5. 
A correlation function can be stated generally as deriving the constants c\ and c 2 in the following 
equation: 

5 [0045] d x +c, <d 2 <d x + c 2 , where d\ and d 2 are values in the range partitioning columns of two 
tables t\ and t 2 , respectively. 

[0046] Assuming the preceding relationship true for all values of the primary index column for some 
values of c\ and c 2 such that c 2 -c x is not excessively large, i.e., it does not span many range 
partitions, then there is reasonable correlation between the two tables. A correlation between a 
primary index value and its partitioning columns, however, is not required. 

[0047] In one example, it is assumed that two tables, namely ORDERS and LINEITEM, are to be 
joined. The ORDERS table has a unique primary index of 0_ORDERKEY and is range partitioned by 
0_ORDERDATE (corresponding to d\) with each partition covering one month. The LINEITEM 
table has a nonunique primary index of L_ORDERKEY and range partitioned by L_SHIPDATE 
(corresponding to d 2 ) with each partition covering one month. In some queries, these tables may be 
joined on their primary index columns. For any particular order in the ORDERS table, its order date is 
less than or equal to any of the ship dates for a line item of the order, and the ship dates occur within a 
relatively small range of time later than their corresponding order dates. For the disclosed example 
tables c\ is zero months and c 2 is four months, which means that the ship dates for each of the line 
items is between 0 months and 4 months after the order date: d x + 0 < d 2 < d x + 4 months. 

[0048] Assume that t 2 is the left and larger table (without loss of generality) for the binary join in the 
following example. Assume that t\ is the right table for the join. The above correlation between tl.dl 
and t2.d2 for equal values of the primary index columns can be rewritten as the following: 

[0049] d 2 -c 2 <d x <d 2 -c x 

[0050] A derived constraint rule (DCR) can typically be represented by the following equation: 

(PK=FK) -> (Date 2 + c x < Datei < Date 2 +c 2 ), 
where c\ 9 c 2 are constants and Datei and Date 2 are date columns in the PK and FK tables, respectively. 
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[0051] In one example, the optimizer initiates the DCR derivation process when a user issues a collect 
statistic statements on a date column, e.g., Datei, of a table that is either a primary key (PK) or a 
foreign key (FK) table and the related table also has a date column, e.g., Date 2 . The purpose of the 
correlation algorithm of block 410 is to derive the values of c\ and c 2 in the equation above. 

5 [0052] The correlation algorithm (block 410) illustrated in Figure 5, assumes that t\ and t 2 are PK-FK 
tables where Datei and Date 2 are date columns in the PK and FK tables, respectively. It is also 
assumed, without loss of generality, that both Datei and Date 2 are not nullable. 

[0053] The process begins by performing an inner join between t\ and t 2 using PK=FK as a join 
condition (block 500). The optimizer will choose the optimal join method. It is not necessary to write 
the join results to a spool, since the constraint between the two dates will be derived during the 
implementation of the algorithm. 

[0054] The algorithm next creates an initial constraint: Date 2 + c\ < Datei ^ Date 2 + c 2 where 
c\ = +oo and c 2 = - °o (block 505). This constraint is referred to as the initial "running constraint" (RC). 

[0055] For each row in the join result, the algorithm computes a new constraint (block 510). To 
compute this new constraint, the values for Datei and Date 2 in Ti and T 2 , respectively, are Di and D 2 , 
respectively, in the row being analyzed in the join result. Using these values, the constraint 
Date 2 + (Di-D 2 ) < Datei < Date 2 + (Di-D 2 ) is deduced. This new constraint is referred to as NEW. 

[0056] For each row in the join result, a new RC is computed by merging the old RC with NEW (block 
415). For the new RC, cl is the minimum of cl of the old RC and (D r D 2 ), and c2 is the maximum of 
c2 of the old RC and (D r D 2 ). The result (block 520) is a DCR of the following form: (PK=FK) -> 
(Date 2 + ci < Datei £ Date 2 + c 2 ) (block 520). 

[0057] To avoid needless imposition of overhead, the "usefulness" of a DCR is analyzed (blocks 525 

and 530). Assuming a uniform distribution of Datei and Date 2 , a DCR is most useful when ci-c 2 is 

minimized. Since both c\ and c 2 were computed from Datei -Date 2 in the algorithm described above, 

25 the range of values for both is from (Datei MrN -Date 2 MAX ) to (Datei MAX -Date 2 MIN ), referred to as Low 

and High, respectively. The usefulness of a DCR is measured as (ci-c 2 )/SIZE where SIZE is the 

interval size for the values of ci-c 2 , which is equal to (High-Low+1). The value of the usefulness 

function is between 0 and 1 with smaller values implying greater usefulness. In one implementation, 
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as a heuristic, the optimizer will save and maintain a DCR only if the usefulness value is less than or 
equal to 0.5. It will be understood that this threshold could be varied without departing from the spirit 
of what is disclosed herein. Note that the usefulness function can be extended for non-uniform 
distribution of one or both of the date columns using collected statistics on those columns. Returning 
5 to Figure 5, once the DCR is computed, the usefulness of the DCR is derived (block 525). If the 
usefulness of the DCR is less than a threshold, e.g., 0.5, the DCR is saved and maintained (block 530). 

[0058] Join Analysis 

[0059] A semantic query based analysis permits the use of a reduced set of partitions for successive 
iterations of the join. For example, by deriving the semantic correlation function, the entire table need 
v not be used when performing an iteration of the join; only the semantically-related partitions need be 
joined. Figure 6 depicts one implementation of the join of block 420. As shown in Fig. 6, the join 
may be comprised of multiple steps. First, the parameters f\ and^ are calculated (block 610), where f\ 
and^ denote the number of partitions in table t\ and t 2y respectively, that will be processed by the join 
algorithm in each iteration. Next, starting partition numbers, Pi and P2 for tables t\ and ti, 
respectively, are determined for the first iteration (block 620). A row hash match scan method 
algorithm is performed on the rows in a reduced set of partitions comprising^ partitions beginning at 
Pi against f\ partitions beginning at Pi (block 630). Finally, the method branches (block 640): if 
additional semantically-related partitions that are candidates for joining exist, then another set of 
partitions are joined, otherwise the algorithm concludes (block 650). 

[0060] One way to calculate the number of partitions used in the join algorithm at each iteration (block 
610) is shown in Fig. 7. This calculation is a two step process. First, the parameter n 9 which denotes 
the number of contiguous partitions in table t\ that may have rows matching a single partition in table 
t2 is determined (block 710). In such a fashion, the semantic information of the correlated columns is 
embedded in the parameter n according to the following equations: 

25 [0061] pc x = SIGNfo)* CEILING f ; 

I P J 

[0062] pc 2 = SIGN (c 2 ) * CEILING 
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[0063] n = pc 2 -pc x + 1 ; 

[0064] where the function SIGN returns the value -1 if the argument is less than 0, else it returns the 
value of 1, CEILING returns the smallest integer greater than or equal to the value of the argument, 
and ABS returns the absolute value of the argument. In another implementation, the function SIGN 
could return any value if the argument is zero. Note that the coefficient pci is Ci normalized with 
respect to the partition range size p, and pc2 is C2 normalized with respect to partition range size p. For 
example, both tables may be partitioned by months, by weeks, or by any other relevant partitioning 
range. Thus, pcj and pc 2 can be termed partitioning correlation coefficients. 

[0065] As previously noted, the value n represents the number of contiguous partitions in the right 
table (/i) that may have rows matching rows in a single partition of the left table (t 2 ). Using the above 
example based upon an ORDERS table and an LINEITEM table, n is 5 partitions, pc 2 is 4, and pc\ is 0. 

[0066] If n + 1 is greater than the parameter m, then another technique is needed to perform the join 
(such as the general spool, sort, and then do the row hash match scan). Otherwise, the modified direct 
row hash match scan algorithm is performed. The parameter m is the maximum number of file 
contexts and associated data blocks reasonable to use for the join without using an excessive amount of 
memory. The value of m would be determined based on the available memory and number of possible 
concurrent users. In one implementation, m is set to eight, which is the same value used in that 
implementation for sort operations that use multiple file contexts (this has been empirically proven to 
be a reasonable value of m). Each partition used in an iteration has an associated file context. A file 
context stores at least the location data for a row and a first value associated with the row for each 
partition that is populated by one or more rows. 

[0067] The number of partitions joined with a joining algorithm during an iteration is determined by 
the semantic relationship between one partition in table t 2 and a set of partitions in table t\ 9 and by the 
maximum amount of memory available for file context storage. The following equations illustrate one 
way to calculate the parameters f\ and f 2 (block 720) assuming a partition range of size p for both 
tables: 
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[0069] /,=*+/,-!. 



[0070] By calculating /i and / 2 in this fashion, at least one file context is allocated to the left table (t 2 \ 
and at least n file contexts are allocated to the right table (t\). Any excess file contexts (up to the 
maximum of m - n - 1) are divided evenly between the two tables. 

5 [0071] Once the number of partitions of each table to be joined is determined (block 610), the starting 
partition numbers are calculated (block 620). In one implementation Pi denotes the first partition in a 
set of f\ partitions in table t\ 9 and Pi denotes the first partition in a set of/2 partitions in table P2 is 
chosen to be the first active, non-eliminated partition in table ti such that at least one of the partitions 
in the range P 2 - pc 2 to P 2 - pc x in table t\ is an active, non-eliminated partition. Pi is then set to 

10 P 2 - pc 2 . An active, non-eliminated partition is a partition that contains at least one row and for which 



q constraints in the query have not specified that this partition does not need to be searched (for instance, 
conditions on the partitioning columns may allow the optimizer to determine that some partitions can 
be eliminated from the join). 



[0072] Following a determination of the actual span of partitions in each table to be joined during each 



■ 15 iteration (e.g., both the starting partition number and the number of partitions in each table to be joined 
fry are calculated), the join algorithm may be performed (block 630). As depicted in Fig. 8, one 
gf implementation of a process for selecting rows by hash value from a set of partitions for a row hash 
match scan algorithm includes several steps. The process can be used on both sets of partitions. First, 
the process creates a file context for each partition in each set of partitions specified in the join 
20 command to track hash values in that partition (block 805) (more particularly described with respect to 
Fig. 9). The process then determines the lowest hash value identified by the file contexts for the set of 
partitions and provides that value to the join algorithm (block 810). The join algorithm requests a 
reading of a hash value or an advance to a hash value (block 815). If the join algorithm requests a 
reading, rows in the set of partitions with that hash value are provided to the algorithm (block 820) 
25 (more particularly described with respect to Fig. 10). If the join algorithm requests an advance, each 
file context for the set of partitions is modified, if necessary, to identify the first row with the desired 
hash value or a greater hash value (block 825) (more particularly described with respect to Fig. 11). If 
file contexts remain (block 830), the lowest identified hash value is again provided to the join 
algorithm (block 810). If no file contexts remain (block 830), the process of selecting rows is finished. 
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In one implementation, the join algorithm receives a signal indicating that the set of partitions has no 
rows with hash values equal to or greater than the last requested hash value. 

[0073] One implementation of the process of creating file contexts for each partition is depicted in Fig. 
9 and involves examining the next partition (or the first partition at the very beginning of the process) 
5 in a set of partitions (block 905). If that partition is empty (block 915) and unexamined partitions still 
exist in the set (block 930), the next partition (block 905) is checked to see if it contains rows (block 
915). If that partition is populated with one or more rows (block 915), an empty file context for that 
partition is created (block 910). The associated partition is then examined to find the row containing 
that partition's lowest hash value or, in another implementation, to find the row containing that 
10 partition's lowest hash value greater than or equal to a given hash value (block 920). Once a value 
meeting the criteria is found in a row in the partition, location information for that row, as well as the 
jl hash value itself, are stored in that partition's file context (block 925). In one implementation, the file 
context is not created until the initial location data and hash value for it are determined. If unexamined 
Ly partitions still exist in the table (block 930), the next partition is examined (block 905), and the process 
g 4,5 is repeated. If all partitions in the set have been examined, the partitions in the set in the other table are 
examined (block 935). When all partitions in all specified sets of partitions have been examined, the 
process of creating file contexts is complete. 



La 



E 



[0074] In one specific implementation, the process of creating file contexts entails, first, an empty file 



ru 
ru 

01 context is created and the file system is requested to find a first data block with a hash value equal to or 

O 

|2j0 higher than 'hhhhhhhh' (hexadecimal notation) ignoring one or more high-order bytes that designate 
the partition. The value 'hhhhhhhh' is the desired starting hash value or '00000000' if it is desired to 
find the first hash value in the table. The join algorithm picks one of the two tables/spools in the join 
and finds its lowest hash value; then using that hash value, probes into the other table/spool. The file 
system positions to the first data block containing a row with equal or higher hash value than the 

25 desired hash value, reads that data block and then positions to the first row in that data block with an 
equal or higher value than the desired hash value. Another empty file context is created. The file 
system is requested to find the next row using that file context with a hash value equal to or higher than 
'hhhhhhhh' in a partition that is at least one higher than the partition number in the row identification 
(row ID) of the first row found. This is continued until no row is found creating as many file contexts 
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as needed in the process. The file contexts are linked together so that the file contexts can easily be 
traversed. 

[0075] Referring again to Fig. 8, the lowest hash value of the new file contexts for a set of partitions is 
determined. In one implementation of this step, the hash value in the first file context for the set is 
5 examined and stored as the "lowest" hash value. Then, the hash value in the next file context for that 
table is examined, and if this new value is smaller than the "lowest" hash value, the new value is stored 
as the "lowest" hash value. That step is repeated until every file context has been examined, at which 
point the "lowest" hash value has been determined. 

[0076] As depicted in Fig. 8, the join algorithm can request that rows having a particular hash value be 
10 read from a set of partitions. One implementation for reading rows is depicted in Fig. 10 and begins 
with selecting the next partition (or the first partition at the very beginning of the process) that 
identifies the hash value to be read (block 1005). The stored location data is used to provide the row 
data to the join algorithm (block 1010). The next row in the partition is then checked to see if it has 
the same hash value (block 1015). If it does, its data is also provided to the join algorithm. Those two 
iTs steps continue until a row with a different hash value is located or no more rows are in the partition. If 
more file contexts identifying the hash value to be read remain (block 1020), rows are provided using 
U the data in those file contexts (block 1005). If not, all the rows have been provided. In one case, the 
pt join algorithm can request that rows with the same hash value be read again. For example, insufficient 
available memory requires that if there are many rows with the same hash value, they are compared 



w 

m 

s 

01 



Q 



p0 piecemeal with rows loaded into memory in partial sets. 

[0077] Figure 1 1 depicts one implementation of the process of advancing each file context to the first 
row in the associated partition that has a hash value greater than or equal to a desired hash value (block 
825). The implementation begins with selecting the next partition (or the first partition at the very 
beginning of the process) that identifies a hash value less than the desired value (block 1105). If a 

25 partition identifies a hash value greater than or equal to the desired hash value, it does not need to be 
advanced, though in some implementations it could be (for instance, if residual conditions indicate that 
the found row is not applicable). Once a file context with a lesser hash value is identified, the rows in 
the associated partition are examined for the desired hash value (block 1110). If one or more such 
rows exist, the hash value and location data for the first such row is recorded in the file context (block 

30 1115). If not, the rows in the associated partition are examined for a hash value greater than the 
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desired hash value (block 1 125). If one or more such rows exist, the hash value and location data for 
the first such row is recorded in the file context (block 1130). If no rows with hash values greater than 
or equal to the desired hash value are present in the partition, the file context is removed (block 1 135). 
The remaining file contexts are then examined for identified hash values less than the desired hash 
5 value (block 1 120) and the process repeats for any such file contexts. 

[0078] By performing a row hash match scan algorithm, the rows are presented to the join algorithm in 
hash order skipping any rows with hash values not needed by the join algorithm. The join algorithm, 
therefore, sees the rows as if all rows were ordered by the row hash instead of being partitioned. 
Reading the rows in this fashion requires additional memory to contain the multiple contexts and 
10 associated data blocks. Additional CPU is needed to manage the multiple file contexts. The number 
of I/Os, however, should be approximately the same as when using a traditional primary index and the 
^ CPU time should be dominated by the join algorithm itself. The performance should only be slightly 



W 



less than with a traditional primary index. 
[0079] Partition Advancing AlRorithm 



0P5 [0080] Following a join between the semantically related partitions in each table, the algorithm 
y* performs another iteration (block 640) if an active, non-eliminated partition that has not been included 
ft in a previous set of partitions exists. In one implementation, a new set of partitions in each table are 

ru 

fin identified for joining. Referring back to Fig. 6, a starting partition number for each set of partitions 

0 

y must be determined for each iteration (block 620). In one implementation (assuming that Pi and P2 are 
20 the starting partition numbers in the previous implementation), P 2 * , a starting partition number for table 
t2 in the next iteration is chosen to be the first active, non-eliminated partition at P2 or higher in table t2 
such that at least one of the partitions between P*- pc 2 and P 2 * - pci, inclusively, in table ti is an 
active, non-eliminated partition. The starting partition number, P* , for table ti in the next iteration is 
then chosen to be P*- pc 2 . 

25 [0081] Figure 12 illustrates one implementation of the advancing algorithm of block 620. Figure 12A 

illustrates the initial configuration of one implementation. The following assumptions for this example 

are made in Figure 12: the left table is LINEITEM and the right table is ORDERS; c\ = 0 months and 

C2 = 4 months for the semantic correlation between OORDERDATE and L_SHIPDATE; m = 12; n - 

5,/2 = 4, and f\ = 8. In this implementation, the initial subset of partitions for ORDERS covers non- 
HOU01:680777.2 17 Express Mail No.:EL865848279US 

Date: December 26, 2001 



NCR 9983 

existent partitions. Additionally this example illustrates that some partitions in table t\ may be non- 
existent during the initial iteration of the method. Furthermore, some of the^ partitions at the end of t 2 
may be non-existent as well. 

[0082] Figure 12B illustrates the result of advancing to the next set of partitions. Here, it is assumed 
5 that the next partition in the left table t\ is a non-eliminated partition that contains data rows. If the 
next partition is eliminated or does not contain data rows, then the method merely goes to the first 
partition that contains data rows. The current^ partitions on the left table of Figure 12B are joined 
with the corresponding current f\ partitions on the right table t\ of Figure 12B using the join method of 
block 630. 

10 [0083] Figure 12C illustrates the advancement to the third and subsequent set of partitions. Here it is 
^ assumed that the next partition in the left table contains rows (otherwise just skip to the first partition 
O that does) and has not been eliminated. The current f 2 partitions on the left are joined to the 
y corresponding current f\ partitions on the right using the join method of block 630. 

H 

jp[ [0084] The number of partitions spanned in the left and right table may be increased when the 
Qp5 partitions are inactive or eliminated. More particularly, whenever there are initially two or more empty 
[a file contexts the span may be increased. In one implementation, the number of partitions spanned in 
the left and right table when the partitions are inactive or eliminated may be increased for an iteration. 
That is, the f\ and^ can be increased for each table by a parameter eps, which is the minimum of the 
rj number of inactive or eliminated partitions in the current span on the right table (t\) and the 
20 corresponding value for the left table (fc) to define a f\ and f{ for this iteration. If the span is 
increased, this can be repeated for additional partitions added to the span. In this fashion, there will be 
a total of f\ +fi partitions processed by the join algorithm for this iteration. 

[0085] In another example, the number of partitions in the span can be increased based upon the 
number of the inactive or eliminated partitions in the current span on the right table (t\) and on the left 
25 table {ti) according to the following equation: 



ru 

ru 
m 



[0086] FLOOR 
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[0087] where x is the sum of the number of inactive or eliminated partitions in the span of f\ partitions 
of table t\ beginning at Pi and in the span of f 2 partitions of table t 2 beginning at P 2 , and FLOOR 
returns the largest integer less than or equal to the value of the argument. If the span is increased, this 
can be repeated for additional partitions added to the span. In this fashion, there will be a total of/j' + 
fi partitions processed by the join algorithm for this iteration. 

[0088] The text above described one or more specific implementations of a broader invention. The 
invention also is carried out in a variety of alternative implementations and thus is not limited to those 
described here. For example, while the invention has been described here in terms of a DBMS that 
uses a massively parallel processing (MPP) architecture, other types of database systems, including 
those that use a symmetric multiprocessing (SMP) architecture, are also useful in carrying out the 
invention. As another example, an implementation has been described with the sorting value as a hash 
value that is also used for distributing rows among storage facilities. Other types of sorting values are 
also useful in carrying out the invention. Many other implementations are also within the scope of the 
following claims. 



HOU01:680777.2 



19 



Express Mail No.:EL865848279US 
Date: December 26, 2001 



